# Spectral Networks and Deep Locally Connected Networks on Graphs

## Introduction

卷积神经网络（CNN）在机器学习问题中非常成功，其中基础数据表示的坐标具有网格结构（1、2和3维），并且要在这些坐标中研究的数据具有平移等变性

Convolutional Neural Networks (CNNs) have been extremely succesful in machine learning problems where the coordinates of the underlying data representation have a grid structure (in 1, 2 and 3  dimensions), and the data to be studied in those coordinates has translational equivariance

在常规网格上，CNN可以利用可很好地配合使用的几种结构来大大减少系统中的参数数量：

On a regular grid, a CNN is able to exploit several structures that play nicely together to greatly  reduce the number of parameters in the system:

转换结构，允许使用过滤器代替通用线性映射，从而实现权重共享。 2.网格上的度量标准，允许使用紧凑支持的滤波器，其支持通常比输入信号的大小小得多。 3.通过跨步卷积和池化实现的网格多尺度二元聚类，允许二次采样。

The translation structure, allowing the use of filters instead of generic linear maps and  hence weight sharing.  2. The metric on the grid, allowing compactly supported filters, whose support is typically  much smaller than the size of the input signals.  3. The multiscale dyadic clustering of the grid, allowing subsampling, implemented through  stride convolutions and pooling.

如果在d维的网格上有n个输入坐标，则具有m个输出的完全连接的层需要n·m个参数，这在典型的操作方式下会导致O（n 2）参数的复杂性。 使用任意过滤器代替通用的完全连接层会降低每个特征图的O（n）参数的复杂性，就像通过构建“本地连接”网络使用度量结构一样[8，17]。 将两者一起使用可得出O（k·S）参数，其中k是特征图的数量，S是过滤器的支持，结果，学习复杂度与n无关。 最后，使用多尺度二元聚类允许每个成功层使用每个滤波器少2 d（空间）坐标的因数。

If there are n input coordinates on a grid in d dimensions, a fully connected layer with m outputs  requires n · m parameters, which in typical operating regimes amounts to a complexity of O(n  2  )  parameters. Using arbitrary filters instead of generic fully connected layers reduces the complexity  to O(n) parameters per feature map, as does using the metric structure by building a “locally connected” net [8, 17]. Using the two together gives O(k · S) parameters, where k is the number of  feature maps and S is the support of the filters, and as a result the learning complexity is independent  of n. Finally, using the multiscale dyadic clustering allows each succesive layer to use a factor of 2  d  less (spatial) coordinates per filter.

但是，在许多情况下，可能会遇到在坐标中定义的数据缺乏某些或全部上述几何特性的情况。 例如，在3-D网格上定义的数据（例如表面张力或温度），来自气象站网络的测量值，或来自社交网络或协作过滤的数据，都是无法应用标准卷积网络的结构化输入的示例。 。 另一个相关示例是源自深度神经网络的中间表示。 尽管空间卷积结构可以在多层使用，但是典型的CNN架构在“特征”维度上没有任何几何形状，从而导致仅沿其空间坐标进行卷积的4-D张量。

In many contexts, however, one may be faced with data defined over coordinates which lack some,  or all, of the above geometrical properties. For instance, data defined on 3-D meshes, such as  surface tension or temperature, measurements from a network of meteorological stations, or data  coming from social networks or collaborative filtering, are all examples of structured inputs on which  one cannot apply standard convolutional networks. Another relevant example is the intermediate  representation arising from deep neural networks. Although the spatial convolutional structure can  be exploited at several layers, typical CNN architectures do not assume any geometry in the “feature”  dimension, resulting in 4-D tensors which are only convolutional along their spatial coordinates.

图提供了一个自然的框架来概括低维的网格结构，并扩展了卷积的概念。在这项工作中，我们将讨论除规则网格以外的其他图形上的深度神经网络的构造。我们提出了两种不同的构造。在第一个例子中，我们展示了可以将属性（2）和（3）扩展到一般图，并使用它们来定义“本地”连接层和池化层，这需要O（n）参数而不是O（n 2） 。我们称其为空间构造。我们称为频谱构造的另一种构造利用了傅立叶域中卷积的性质。在R d中，卷积是被傅立叶基础exp（iω·t），ω，t∈R d对角化的线性算子。然后可以通过找到相应的“傅立叶”基础将卷积扩展到一般图。这种等效性通过图拉普拉斯算子给出，该算子对图[1]提供谐波分析。频谱构造每个特征图最多需要O（n）个参数，并且还可以进行构造，其中参数的数量与输入维数n无关。这些构造允许有效的前向传播，并且可以应用于具有非常大量坐标的数据集。

Graphs offer a natural framework to generalize the low-dimensional grid structure, and by extension  the notion of convolution. In this work, we will discuss constructions of deep neural networks on  graphs other than regular grids. We propose two different constructions. In the first one, we show  that one can extend properties (2) and (3) to general graphs, and use them to define “locally” connected and pooling layers, which require O(n) parameters instead of O(n  2  ). We term this the spatial  construction. The other construction, which we call spectral construction, draws on the properties  of convolutions in the Fourier domain. In R  d  , convolutions are linear operators diagonalised by the  Fourier basis exp(iω·t), ω, t ∈ R  d  . One may then extend convolutions to general graphs by finding  the corresponding “Fourier” basis. This equivalence is given through the graph Laplacian, an operator which provides an harmonic analysis on the graphs [1]. The spectral construction needs at most  O(n) paramters per feature map, and also enables a construction where the number of parameters is  independent of the input dimension n. These constructions allow efficient forward propagation and  can be applied to datasets with very large number of coordinates.

## 2

正如[3]所建议的那样，CNN对一般图形的最直接的概括是考虑**多尺度，分层，局部接受域**。 为此，将用加权图G =（Ω，W）代替网格，其中Ω是大小为m的离散集合，W是m×m对称且非负的矩阵。

The most immediate generalisation of CNN to general graphs is to consider multiscale, hierarchical,  local receptive fields, as suggested in [3]. For that purpose, the grid will be replaced by a weighted  graph G = (Ω, W), where Ω is a discrete set of size m and W is a m×m symmetric and nonnegative  matrix.

### locality

局部性的概念可以在图的上下文中轻松地概括。 实际上，图中的权重决定了局部性的概念。 例如，在W上定义邻域的一种直接方法是设置阈值δ> 0并采用邻域

The notion of locality can be generalized easily in the context of a graph. Indeed, the weights in a  graph determine a notion of locality. For example, a straightforward way to define neighborhoods  on W is to set a threshold δ > 0 and take neighborhoods

我们可以将注意力集中在稀疏的“过滤器”上，这些稀疏“过滤器”具有由这些邻域提供的接收场以获取本地连接的网络，从而将过滤器层中的参数数量减少到O（S·n），其中S是平均邻域大小。

We can restrict attention to sparse “filters” with receptive fields given by these neighborhoods to get  locally connected networks, thus reducing the number of parameters in a filter layer to O(S · n),  where S is the average neighborhood size.

###  Multiresolution Analysis on Graphs

CNN通过池化和二次采样层减小了网格的大小。 这些层之所以成为可能是因为网格具有自然的多尺度聚类：它们在一个聚类上输入了所有要素地图，并为该聚类输出了一个要素。 在网格上，二元聚类在度量和拉普拉斯算子（以及转换结构）方面表现良好。 关于在图上形成多尺度聚类有大量文献，例如，参见[16、25、6、13]。

CNNs reduce the size of the grid via pooling and subsampling layers. These layers are possible  because of the natural multiscale clustering of the grid: they input all the feature maps over a cluster,  and output a single feature for that cluster. On the grid, the dyadic clustering behaves nicely with  respect to the metric and the Laplacian (and so with the translation structure). There is a large  literature on forming multiscale clusterings on graphs, see for example [16, 25, 6, 13].

寻找经证明可保证良好工作的多尺度聚类 图上的拉普拉斯算子仍然是一个开放的研究领域。 在这项工作中，我们将使用幼稚的凝聚方法。 图1说明了具有相应邻域的图的多分辨率聚类

Finding multiscale clusterings that are provably guaranteed to behave well w.r.t. Laplacian on the graph is  still an open area of research. In this work we will use a **naive agglomerative method**.  Figure 1 illustrates a multiresolution clustering of a graph with the corresponding neighborhoods

### Deep Locally Connected Networks

空间构造从图的多尺度聚类开始，类似于[3]中我们考虑K尺度。 我们设置Ω0=Ω，并且每个k = 1。 。 。 K，我们定义Ωk，Ωk-1划分为dk簇； 以及Ωk-1的每个元素周围的邻域的集合：

The spatial construction starts with a multiscale clustering of the graph, similarly as in [3] We  consider K scales. We set Ω0 = Ω, and for each k = 1 . . . K, we define Ωk, a partition of Ωk−1  into dk clusters; and a collection of neighborhoods around each element of Ωk−1:

有了这些，我们现在可以定义网络的第k层。 我们不失一般性地假设输入信号是用Ω0定义的实信号，并且用fk表示在每层k处创建的“滤波器”的数量。 网络的每一层都将将由Ωk-1索引的fk-1维信号转换为由Ωk索引的fk维信号，从而在空间分辨率与新创建的特征坐标之间进行权衡。 更正式地讲，如果xk =（xk，i; i = 1 ... fk-1）是dk-1×fk-1是第k层的输入，则其输出xk 1定义为

With these in hand, we can now define the k-th layer of the network. We assume without loss of  generality that the input signal is a real signal defined in Ω0, and we denote by fk the number of  “filters” created at each layer k. Each layer of the network will transform a fk−1-dimensional signal  indexed by Ωk−1 into a fk-dimensional signal indexed by Ωk, thus trading-off spatial resolution  with newly created feature coordinates.  

## 3

## 4 previous study

关于在图上构建小波有大量文献，例如，参见[21、7、4、5、9]。 用神经网络的语言，基于网格的小波是具有某些可证明的规则性的线性自动编码器（特别是在编码各种类别的平滑函数时，保证了稀疏性）。 经典小波变换中的正向传播与神经网络中的正向传播非常相似，不同之处在于每一层只有一个滤波器图（并且通常在每一层都具有相同的滤波器），并且保持每一层的输出 ，而不仅仅是最后一层的输出。 传统上，不学习过滤器，但构造该过滤器是为了便于进行规律性证明。

There is a large literature on building wavelets on graphs, see for example [21, 7, 4, 5, 9]. A wavelet  basis on a grid, in the language of neural networks, is a linear autoencoder with certain provable  regularity properties (in particular, when encoding various classes of smooth functions, sparsity  is guaranteed). The forward propagation in a classical wavelet transform strongly resembles the  forward propagation in a neural network, except that there is only one filter map at each layer (and  it is usually the same filter at each layer), and the output of each layer is kept, rather than just  the output of the final layer. Classically, the filter is not learned, but constructed to facilitate the  regularity proofs.

在图的情况下，目标是相同的； 除了将网格上的平滑度替换为图形上的平滑度之外。 与经典情况一样，大多数作品都尝试基于图来显式构造小波（即，无需学习），以使相应的自动编码器具有正确的稀疏性。 在这项工作以及最近的工作[21]中，“滤波器”受到构造的约束，以具有小波的某些规律性，但也经过训练，使其适合于独立于（但可能与之相关）的任务 图上的平滑度。 文献[21]仍然建立了一个（稀疏的）线性自动编码器来保持基本的小波变换设置，而这项工作着眼于非线性结构。 特别是尝试建立CNN的类似物。

In the graph case, the goal is the same; except that the smoothness on the grid is replaced by smoothness on the graph. As in the classical case, most works have tried to construct the wavelets explicitly  (that is, without learning), based on the graph, so that the corresponding autencoder has the correct  sparsity properties. In this work, and the recent work [21], the “filters” are constrained by construction to have some of the regularity properties of wavelets, but are also trained so that they are  appropriate for a task separate from (but perhaps related to) the smoothness on the graph. Whereas  [21] still builds a (sparse) linear autoencoder that keeps the basic wavelet transform setup, this work  focuses on nonlinear constructions; and in particular, tries to build analogues of CNN’s.

与当前工作相对的另一项工作是从数据中发现网格拓扑。 在[19]中，作者通过证明一个人可以通过二阶统计量恢复二维网格结构，从经验上证实了3.3节的陈述。 在[3，12]中，作者估计了构建本地连接网络的特征之间的相似性。

Another line of work which is rellevant to the present work is that of discovering grid topologies  from data. In [19], the authors empirically confirm the statements of Section 3.3, by showing that  one can recover the 2-D grid structure via second order statistics. In [3, 12] the authors estimate  similarities between features to construct locally connected networks.

### multigrid

我们可以通过与拉普拉斯算子很好地配合的图的多尺度聚类来改善这两种构造，并在一定程度上统一它们。 如前所述，在网格的情况下，标准二元立方体具有以下特性：将网格上的傅立叶函数二次采样到较粗的网格与在较粗的网格上找到傅立叶函数相同。 此属性将消除将频谱结构映射到每一层的最细网格以进行非线性处理的烦人的必要性； 并允许我们（通过插值）将空间构造中较深层的局部滤波器解释为低频

We could improve both constructions, and to some extent unify them, with a multiscale clustering  of the graph that plays nicely with the Laplacian. As mentioned before, in the case of the grid,  the standard dyadic cubes have the property that subsampling the Fourier functions on the grid to a  coarser grid is the same as finding the Fourier functions on the coarser grid. This property would  eliminate the annoying necessity of mapping the spectral construction to the finest grid at each layer  to do the nonlinearity; and would allow us to interpret (via interpolation) the local filters at deeper  layers in the spatial construction to be low frequency

这种聚类是解决离散化PDE（通常是线性系统）的多重网格方法的基础[24]。 有几篇论文扩展了多网格方法，特别是与多网格方法相关的多尺度聚类，其设置比常规网格更通用，例如，本文的情况请参见[16，15]，以及 有关代数多重网格方法，请参见[24]。 在此工作中，为简单起见，我们在空间侧构造上使用了幼稚的多尺度聚类，这不能保证尊重原始图的拉普拉斯算子，并且在光谱构造中也没有明确的空间聚类。

This kind of clustering is the underpinning of the multigrid method for solving discretized PDE’s  (and linear systems in general) [24]. There have been several papers extending the multigrid method,  and in particular, the multiscale clustering(s) associated to the multigrid method, in settings more  general than regular grids, see for example [16, 15] for situations as in this paper, and see [24] for the  algebraic multigrid method in general. In this work, for simplicity, we use a naive multiscale clustering on the space side construction that is not guaranteed to respect the original graph’s Laplacian,  and no explicit spatial clustering in the spectral construction.

## Numerical Experiment

先前的构造在MNIST数据集的两个变体上进行了测试。 首先，我们对普通的28×28网格进行二次采样以获得400个坐标。 这些坐标仍然具有2D结构，但是无法使用标准卷积。 然后，我们通过在3-D单位球面上放置d = 4096个点来制作数据集，并将随机MNIST图像投影到这组点上，如第5.2节所述

The previous constructions are tested on two variations of the MNIST data set. In the first, we  subsample the normal 28 × 28 grid to get 400 coordinates. These coordinates still have a 2-D  structure, but it is not possible to use standard convolutions. We then make a dataset by placing  d = 4096 points on the 3-D unit sphere and project random MNIST images onto this set of points,  as described in Section 5.2

在所有实验中，我们都使用整流线性单位作为非线性和最大池。 我们使用固定学习率0.1和动量0.9训练具有交叉熵损失的模型。

In all the experiments, we use Rectified Linear Units as nonlinearities and max-pooling. We train  the models with cross-entropy loss, using a fixed learning rate of 0.1 with momentum 0.9.

### Subsampled MINST

我们首先将3.2和2.3节中的构造应用于二次采样的MNIST数据集。图3显示了生成的输入信号的示例，图4、5显示了分别从图和图拉普拉斯算子的一些本征函数构造的层次聚类。表1中报告了各种图形体系结构的性能。作为基线，我们计算了标准的最近邻分类器，其性能比完整的MNIST数据集（2.8％）稍差。两层全连接神经网络可将误差降低到1.8％。数据的几何结构可以使用CNN图体系结构进行开发。适应图结构的局部接收场优于完全连接的网络。特别地，两层过滤和最大池化定义了一个网络，该网络可以有效地将信息聚合到最终分类器中。光谱构造在此数据集上的表现稍差。我们考虑了N / 2 = 200的频率截止。但是，第3.4节中描述的频率平滑架构包含的参数数量最少，其性能优于常规频谱构造。

We first apply the constructions from sections 3.2 and 2.3 to the subsampled MNIST dataset. Figure 3 shows examples of the resulting input signals, and Figures 4, 5 show the hierarchical clustering constructed from the graph and some eigenfunctions of the graph Laplacian, respectively. The performance of various graph architectures is reported in Table 1. To serve as a baseline, we compute the standard Nearest Neighbor classifier, which performs slightly worse than in the full MNIST dataset (2.8%). A two-layer Fully Connected neural network reduces the error to 1.8%. The geometrical structure of the data can be exploited with the CNN graph architectures. Local Receptive Fields adapted to the graph structure outperform the fully connected network. In particular, two layers of filtering and max-pooling define a network which efficiently aggregates information to the final classifier. The spectral construction performs slightly worse on this dataset. We considered a frequency cutoff of N/2 = 200. However, the frequency smoothing architecture described in section 3.4, which contains the smallest number of parameters, outperforms the regular spectral construction.  

这些结果可以解释如下。 MNIST数字的特点是局部定向笔划，需要进行良好空间定位的测量。 构建局部接收场以显式满足此约束，而在频谱构建中，不强制执行测量以使其在空间上局部化。 在过滤器的光谱上添加平滑度约束可以改善分类结果，因为强制执行过滤器以具有更好的空间定位。

These results can be interpreted as follows. MNIST digits are characterized by localized oriented  strokes, which require measurements with good spatial localization. Locally receptive fields are  constructed to explicitly satisfy this constraint, whereas in the spectral construction the measurements are not enforced to become spatially localized. Adding the smoothness constraint on the  spectrum of the filters improves classification results, since the filters are enforced to have better  spatial localization.

图6中说明了这一事实。由于没有全局结构将它们绑定在一起，因此我们验证了局部接受域在不同空间邻域中编码了不同的模板。 另一方面，频谱结构具有生成局部测量结果的能力，该局部测量结果可在整个图形中进行概括。 当频谱乘数不受约束时，如面板（c）-（d）所示，所得的滤波器趋向于在空间上离域。 这对应于傅立叶分析对局部现象进行编码的基本限制。 但是，我们在面板（e）-（f）中观察到，对图拉普拉斯算子的频谱进行简单的平滑恢复了某种形式的空间局部化，并创建了可以在不同空间位置上泛化的滤波器，这是卷积算符所期望的。

This fact is illustrated in Figure 6. We verify that Locally Receptive fields encode different templates  across different spatial neighborhoods, since there is no global strucutre tying them together. On the  other hand, spectral constructions have the capacity to generate local measurements that generalize  across the graph. When the spectral multipliers are not constrained, the resulting filters tend to be  spatially delocalized, as shown in panels (c)-(d). This corresponds to the fundamental limitation of  Fourier analysis to encode local phenomena. However, we observe in panels (e)-(f) that a simple  smoothing across the spectrum of the graph Laplacian restores some form of spatial localization  and creates filters which generalize across different spatial positions, as should be expected for  convolution operators.

## Conclusion

使用卷积架构的基于图的类似物可以大大减少神经网络中的参数数量，而不会恶化（并经常改善）测试误差，同时还可以提供更快的前向传播。 这些方法可以缩放为具有大量具有局部性概念的坐标的数据。

Using graph-based analogues of convolutional architectures can greatly reduce the number of parameters in a neural network without worsening (and often improving) the test error, while simultaneously giving a faster forward propagation. These methods can be scaled to data with a large  number of coordinates that have a notion of locality.

在这里有很多事情要做。 我们怀疑，通过更仔细的培训和更深入的网络，我们可以在“流形样”图（如采样球体）上的完全连接的网络上不断改进。

There is much to be done here. We suspect with more careful training and deeper networks we can  consistently improve on fully connected networks on “manifold like” graphs like the sampled sphere.

此外，我们打算将这些技术应用于较不人为的问题，例如，在netflix之类的推荐问题上，即在数据和坐标都集中的情况下。 最后，特征向量的朴素排序的平滑性会导致结果得到改善，并且局部化滤波器的事实表明，可以使每个滤波器具有O（1）参数的“双重”构造比网格具有更大的通用性。

Furthermore, we intend to apply these techniques to less artifical problems, for example, on netflix  like recommendation problems where there is a biclustering of the data and coordinates. Finally,  the fact that smoothness on the naive ordering of the eigenvectors leads to improved results and  localized filters suggests that it may be possible to make “dual” constructions with O(1) parameters  per filter in much more generality than the grid.